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Let 



where X, X\ , X n are i.i.d. random variables in a measurable space (S, A) with distribution 
II and . . . ,£n are i.i.d. random variables with E£ = independent of (-Xi, . . . ,X n ). Given a 
dictionary hi, . . . , Hn '■ S h 



Ae 



let fx :=Yjj=\^i h 3> A = (Ai,...,Aiv) er. Given e > 0, define 

n 



: max 

Kfe<JV 



and 



A := A e G Argmin || A|| £l . 

AgA E 

In the case where /. := fx*, A* el", Candes and Tao [Ann. Statist. 35 (2007) 2313-2351] 
suggested using A as an estimator of A*. They called this estimator "the Dantzig selector". 
We study the properties of as an estimator of /„ for regression models with random design, 
extending some of the results of Candes and Tao (and providing alternative proofs of these 
results) . 

Keywords: Dantzig selector; oracle inequalities; regression; sparsity 



1. Introduction 

Consider a regression model with random design, 

Yj=f4X 3 )+^, j = l,...,n, 

where X, X\, . . . ,X n are i.i.d. random variables in a measurable space (S,A) with dis- 
tribution II and £, £i, ■..,£„ are i.i.d. random variables with E£ = 0, independent of 
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(X\, . . . ,X n ) (in what follows, it will be assumed that the noise £j satisfies some further 
assumptions, such as, for instance, is N(0,a 2 )). 

Let hi,..., /ijv be a dictionary consisting of N > 2 functions from S into R. Define 

JV 

/ A :=^A,^, A = (A 1 ,...,A A r)eR Ar . 

3=1 

Given e > 0, define the set 

A e := i A G R N : max 

l<fc<JV 

and consider 

A := X £ G ArgminllAl^j. 

AeA e 

Although the set of constraints A e could be empty, we will see that for sufficiently large 
values of e, it is non-empty with a high probability (if A e = 0, one can define A e in an 
arbitrary way, for instance, A e = 0). 

In the case where /* = f\* for some A* G M. N , Candes and Tao (2007) suggested using 
A e as an estimator of the vector of coefficients A*. It is easy to see that the computation 
of A e reduces to a linear programming problem: 

N 

Uj -> min, 

3=1 

subject to the constraints 

n 

u k >0, -u k <\ k <u k , -e<n- 1 ^(/ A (X J )-y,)^(^) <e, k=l,...,N. 

j=i 

Candes and Tao called this estimator "the Dantzig selector". It is closely related to 
the .^-penalization method (similar to what is called "LASSO" in statistical literature), 
which is based on fitting the regression model by solving the following penalized empirical 
risk minimization problem: 

n 

n ~ X Z)(/aP0) - Y if + 2e||A|k = : L nW + 2e||A|k min . (1.1) 

3=1 

Note that 

A E = {A:||Vi n (A)|U 00 <2 £ } 
and that A G A 6 is a necessary condition for A to be a solution of (1.1). 



<e 
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We will establish several "sparsity oracle inequalities" for the Dantzig selector that 
are akin to recent inequalities proved in Bunea, Tsybakov and Wegkamp (2007), van 
de Geer (2008) and Koltchinskii (2009) in the case of l\- or ^-penalized empirical risk 
minimization. Candes and Tao (2007) concentrated on the case of fixed design regres- 
sion models, that is, when the design points X\, . . . ,X n are non-random. They proved 
their version of oracle inequalities under the basic assumption that the design matrix 
A = {hj(Xi))i = i_ n ._j = i_N satisfies the so called uniform uncertainty principle (UUP). To 
explain the meaning of this assumption, define 



J A :=supp(A) :={j:X 3 r ^ 0}, Ae 



and set d(X) := card(J>). Define <5d(n) to be the smallest S > such that for all A G K w 
with d(X) < d, 



(l-<5)l|A|k< 



JY 



<(1 + <5)||A||, 2 . 

i 2 (n) 

If (5d(II) < 1, then <i-dimcnsional subspaces spanned on subsets of the dictionary and 
equipped with either the L 2 (n)-norm, or the £ 2 - n orm on vectors of coefficients are "al- 
most" isometric. Given the dictionary {h\, . . . , h^}, it is natural to call the quantity 
5d(n) the restricted isometry constant of dimension d with respect to measure II. If II„ 
denotes the empirical measure based on the design points X%, . . . ,X n , then the UUP 
essentially means that the restricted isometry constants 8d(H n ) (which are characteris- 
tics of the design matrix A) are sufficiently small for the values of d comparable with 
the degree of sparsity of representation of /» in the dictionary (the number of non-zero 
coefficients of A*). Candes and Tao (2007) stated that the UUP holds with a high proba- 
bility for some random design matrices such as the Gaussian ensemble (the matrix with 
i.i.d. standard normal entries). It is also true for the Bernoulli or Rademacher ensemble 
(the matrix with i.i.d. entries taking values +1 and —1 with probability 1/2), which 
relies on some facts concerning random matrices that were established in other papers. 
In these examples, the dictionaries are orthonormal systems in the space £2(11), which 
means that <5d(n) = 0. 

We here provide more direct proofs of oracle inequalities in the random design case that 
do not rely on the bounds for random matrices and that apply to broader classes of design 
distributions, in particular, to such distributions that the dictionary is not necessarily 
orthonormal in L/2(Tl), but rather satisfies a restricted isometry condition with respect to 
II. The next statement is a typical example of what follows from the results of Sections 
2 and 3 (specifically, from Corollary 6). 

Proposition 1. Suppose that the random vector (hi(X), . . . , hw(X)) has normal dis- 
tribution with zero mean and that the noise £ is N(0;a 2 ). In addition, suppose that 
f* = /a* . A* G M. N . There then exist constants 6 G (0, 1) and C, D > with the following 
property. For an arbitrary A>1, denote by d the largest d < N/e — 1 such that 



Mn) < s 
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and 



cJ Adl °^<l/4. 



Then, for all 

,., . lAlogN 

E - Cff ,S x J |it|ll! < n )V— ■ 

the condition d(X*) < d implies that with probability at least 1 — N~ A , 

\\\-\*\\ e . 2 <D^d()C)e. 

Our approach is based on some facts concerning empirical and Rademacher processes 
and it is close to the approach taken by Rudclson and Vershynin (2005) or Mcndclson, 
Pajor and Tomczak-Jaegermann (2007). At the same time, it relies only on rather el- 
ementary tools (symmetrization and contraction inequalities for Rademacher processes 
and Bernstein- type exponential bounds) and does not use more advanced techniques, 
such as concentration of measure and generic chaining, which arc used in the papers 
cited above. It is worth mentioning that Koltchinskii (2005, 2009) showed that if, in 
(1.1), one uses ||A||f with p = 1+ instead of || A]]^ I , then one can establish a version 
of sparsity oracle inequalities without making strong assumptions on the dictionary such 
as a restricted isometry condition. 

In the next section, we introduce some geometric characteristics of the dictionary that 
are of importance in analysis of sparse recovery problems, and we prove general oracle 
inequalities for the Dantzig selector in terms of these characteristics. Several corollaries, 
more special results and some examples are given in Section 3. Finally, the Appendix 
contains some exponential bounds for Rademacher processes needed in the proofs of the 
main results. 



2. Main results 

In what follows, we frequently use the Orlicz norm || • ||^ for random variables, most often 

with tp = ipi, ipi (x) := e^ x \ — \ on ip = ip2i i>2 i x ) '■ = e x — 1- For any convex non-decreasing 
function if> : M.+ i— > M. + with ip(Q) = 0, it is defined as 

hlU:-inf|c>O:E0^ <l| 

(see Ledoux and Talagrand (1991), van der Vaart and Wellner (1996), de la Pena and 
Ginc (1998)). 

For Jc{l,...,iV}, let d(J) :=card(J). Define 

Cj:=lueR N :Y,Wi\<Y,\ u A- 
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The set Cj is a cone in M. N (that is, u £ Gj implies that au £ Cj for all a > 0). It consists 
of vectors u £ M. N such that the coordinates of u in the set J are dominant. Such cones 
of dominant coordinates play an important role in the analysis of the Dantzig selector, 
LASSO and other sparse recovery methods. The reason is that, for a "sparse" feasible 
vector A £ A e , the definition of the Dantzig selector X e means that || A £ ||^ 1 < 1 1 A 1 1 , which 
implies that 

E i*l - A ii = E ^ E - ^ E - A *i (2- 1 ) 

and, hence, A — A € Cj x . The proofs of various bounds on the norms of the vector X s — A 
and on the norms of the corresponding function 

fa - fx £ l-s.({fti, . . -,h N }) 

are usually based on the comparison of these norms on the cone Cj x . We will introduce 
several geometric characteristics of the dictionary that are needed for such a comparison. 
Define 

0{J) := 0{J\ II) := inf J > : VA e Cj, ^ |Aj| < 

(set 0{J) = if J = 0). Note that if J 7^ and the functions hi,...,hjf m the dictionary 
are linearly independent in Li(II), then 0(J) < +00. 
Another quantity of interest is 

02{J) := 2 (J; H) := inf J > : VA G Cj, £ |A 3 f < 2 

I j£J 

Note that /9a (J) = 1, J 7^ if the dictionary {hi, . . . , ft-jv} is orthonormal. In Section 3, 
the connection of these quantities to restricted isometry constants <5d(II) is discussed. 
In particular, it can be shown that if #3d(n) is small enough, then, for all sets J of 
cardinality d, 2 {J) remains properly bounded. 

The following condition on the dictionary and on the distribution II is often of interest: 
for all A G Cj 



N 




N 




N 




E a a 


< 


E A ^ 


<B(J) 


E a a 






Li(n) 


3=1 


L 2 (n) 


3=1 


ii(n 



with some constant B(J) > 0. 

Note that the first inequality in (2.2) is trivial for all A £ M. N . The second (non-trivial) 
bound holds for all A G R , with a constant B > that does not depend on N, and on 
the set J in several interesting, but rather special, examples. In particular, this condition 
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holds when (hi(X), . . . , /ijvpf)) has mean zero normal distribution in M. N (for instance, 
if h\(X), . . . , h^{X) are i.i.d. standard normal, which is the case for the Gaussian dictio- 
nary) or when h\(X), . . . , hj^(X) are i.i.d. Radcmacher random variables (that is, hj(X) 
is +1 or —1 with probability 1/2 each; this is the case for Bernoulli and Radcmacher 
dictionaries). In the last case, (2.2) holds by the Khinchine inequality. For Gaussian and 
Bernoulli dictionaries, all L p -norms, p > 1, and even ipi- and ^2-norms of J2j=i ^j^j are 
equivalent up to numerical constants (see Bobkov and Houdre (1997) for a discussion of 
more general Khinchine-type inequalities and their connections with isoperimetric con- 
stants). In general, the constant B might depend on N and, since this constant is involved 
in the bounds on the performance of the Dantzig selector, it is of some importance that 
condition (2.2) is supposed to hold only for all A G Cj (rather than for all A G M. N ), and 
this condition is usually needed for a small set J. 

Under the condition (2.2), the following bound is straightforward: 



p{J)<B{J)(h{J)Vd(jj. (2.3) 

If P%{J) is bounded by a small constant (as in the case of orthonormal dictionaries), then 
/?(J) is "small" for sets J of small cardinality d(J). 

Recall the notation J\ := supp(A) and also recall that d(X) := d(J\). 

We will fix the values of e > 0, A > and C > 0, assume that 

AlogN 

— < 1 

n 

and define the following set: 
A*(A) := | A G M. N : |(/a — f*,hk)L 2 (n) I 



+ c(\\(h - + <e, k = l,...,Nj 

(recall that ^ involved in the above definition is the noise of the regression model) . Under 
the condition 



AloszN 

s>C max . — , 

l<k<N V TL 

which is necessary for the set A e (A) to be non-empty, this set consists of vectors A such 
that f\ is, in a certain sense, a good approximation of /» . The condition 

i<fe<7v'^ A ~ /*,^fc)i 2( n)| <£ (2.4) 

that follows from A G A e (A) essentially means that fx — /* is almost ( "up to e" ) orthog- 
onal to the linear span of the dictionary, so fx should be close to the projection of /* 
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on the linear span. In fact, (2.4) is a necessary condition of the minimum in the convex 
minimization problem 

II/a — /*lli a (n) + 2e|| A||^ — »• min, 

which can be viewed as a distribution-dependent version of the empirical risk minimiza- 
tion (1.1) (recall that A £ A £ is a necessary condition for (1.1)). 

If f\o,X° £ M. N is the orthogonal projection in £2(11) of the function /* onto the linear 
span of the dictionary, then it is obvious that the condition 



e > C wbx n (\\ (f X o -f*)(X)h k {X)\\^ + \\th k (X)\\^ ) 

is sufficient for A e (A) 7^ (since, under this condition, A £ A £ (A)). 

The next proposition shows that if A e (A) 7^ 0, then, with a high probability, A £ 7^ 0. 



I AlogN 



Proposition 2. Suppose that A £ (A) 7^ and A £ A e (A). Then, with probability at least 
l-2N~ A , A £ A e . 

Proof. Indeed, for any such A, we have 



< I (/a - f*,h k )L 2 (U)\ 
n 

+ n- 1 £[(A(*i) - UiX^hkiXj) - E(h(X) - U{X))h k (X)] 
j'=i 

n 

n^^jhkiXj) 



Applying Lemma 3 from the Appendix to the second and third terms yields, with prob- 
ability at least 1 - 2N~ A : 



max 

Kk<N 



< max 

Kk<N 



\(fx - f*,h k ) L2( n)\ + C(\\(h - f*){X)h k {X)\\^ + Uhk(X)\\^) 



AlogN 



by the definition of the set A e (A). 



□ 
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Also, define 



A S (A) :=heR N : C0(J x ) ma^ \\h k (X)\\^ yj < 1/4 j. 

We will interpret As (A) as a set of "sparse" vectors since, in view of the bound (2.3), 
I3{J\) has some connection to the sparsity of A. Of course, the fact that (3(J\) is not too 
large is also related to the properties of the dictionary. For dictionaries that are close to 
being orthonormal, As (^4) would include sparse enough vectors in the usual sense. 

Essentially, the bounds of Theorems 1 and 2 below show that if there exists a vector 
A in A e (the set of constraints of the Dantzig selector) that is sufficiently "sparse" , then 
the Dantzig selector will be in a small ball around A in such norms as || • and |j • ||f 2 , 
or will be in a small ball around fx in such norms as || ■ Hz^m) an d || ■ ||z, 2 (n)- The 
radius of this ball crucially depends on the degree of sparsity of A and also on the "well- 
poscdness" of the dictionary characterized by such quantities as fa (see also Section 3 for 
a discussion of the connection of these quantities to restricted isometry properties of the 
dictionary). The bounds also imply that the Dantzig selector is adaptive to an unknown 
degree of the sparsity of the problem (at least in the case when the dictionary is not very 
far from being orthonormal in L2(H))- 

Let 

A £ (A):=A £ (A)nA s (A). 

This set will be interpreted in the next theorem as a set of oracle vectors and it will be 
assumed that A £ (A) ^ 0. In particular, it means that e must satisfy 



AloeN 

e>C max Uh k (X) \\ M ^~ 

i<k<N ^ V n 

(which, of course, requires that ||^ft-fe(A)||^ 1 < +oo). The fact that A € A £ (A) implies 
that A is sparse in the sense that A 6 As (A) and, at the same time, that fx provides a 
reasonably good approximation of /* in the sense that both (2.4) holds and 



AloeN 

C\\(fx /*)(X)M*)lky — |- < e. 

If A S (A) = 0, then there arc no sparse vectors A for which fx approximates /* well, so, 
from this point of view, the problem is not sparse. 
First, we prove the following result. 



Theorem 1. There exists a constant C in the definitions of A £ (A), As (A) such that 
for all A> 1 with probability at least 1 — N~ A , the following bounds hold for all A £ 
A e fl As (A) and for the Dantzig selector A: 

IIA-Alk(n)<16/3(JA)e 
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and 

||A-A||* < 32/3 2 (J A )e. 
Under the assumption that A e (A) ^ 0, with the same probability, 

||/x-/*IU l( n)< mf [||/A-/*||L l( n) + 16/3(JA)e], 

AeA, (A) 

and if, in addition, /* = / A » , A* G R w , then we also have 

||A-A*||*< inf [||A-A*||*+32/3 2 (Ja)4 

AeA e (A) 



Proof. Suppose that A G A e n As (A). The proof of the first two bounds will be based on 
upper bounding ||A — A||* in terms of ||/t — /a||li(ii) an d y i cc versa. Combining these 
bounds yields inequalities on both of the norms that can be solved, leading to the first 
two bounds of the theorem. 

Since A G A e , (2.1) implies that A — A G Cj x and 

||A-A||*< J] N+£ l^--M< 2 £ |A,--A J -|<2 / 9(J A )||/ A -MU l( n ) . (2.5) 

We will now upper bound ||/t — /a||li(ii) m terms of ||A — AH*, which will imply the 
result. We start with the following, obvious, bound (we use the notation v{f) := / f dv): 



\\h - hh l{ n) = \\h- MlL l( n„) + (n - n„)(|/ A - / A I 



<IIA-MU l( n„)+ sup |(n„-n)(|/„|)|||A-A||*. 



(2.6) 



We will separately bound the first and second terms of this bound. First, note that 
Wf'x ~ /Allien,,) < \\f~x ~ /AlL 2 (n») = (fx - fx-, fx ~ fx)L 2 {u n ) 



N 



£(A fe ~ X k)(fx - fx,hk)L 3 (n n ) < || A - A||* ^maxjifo - fx,h k ) L2 (n n )\- 



k=l 



Since both A G A e and A G A e , we have 
1 ™^ N \(h-h> h k)i*<p n )\ 



< max 

Kk<N 



which implies that 



max 

Kk<N 



j'=i 



<2e, 



\\fx-.fx\\L l{ n n )<\j2£\\\-\\\ ei 
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By Lemma 4 from the Appendix, with probability at least 1 — N A (under the assumption 
AlogN < n), 



sup |(n n -n)(|/„|)|<C max ji^ii^ 

\\u\\ H <l l<k<N 

This yields the following bound (with probability at least 1 — N~ A ): 

I ' I Aloe N - 

Il/X-Mk(n)< y/Ml-MW+CwgJhk^— Jk—\\\-\\\ tl . (2.7) 

Together with (2.5), this implies that 

\\h - Mk(n) < ^ep{J x )\\f x - h\\ Ll{n) 

+ 2C max ||/i fe ||* J^^-(3( J x )\\f X - Mk(n) ■ 

1 <. <. iV V 71 

Recalling the definition of the set As (A), we can guarantee that 

2C rn^x Jh k \uJ^^P(J x ) < 1/2, 
i<k<N V n 

which implies that 

II A - fxhm < 2 v /4 £ /3(J A )||.^-M| il( n ) , 

and the first bound now follows. The second bound is also true, in view of (2.5). 

To prove each of the remaining bounds, define A to be the vector for which the infimum 
in the right-hand side of the bound is attained. By Proposition 2, with probability at 
least 1 — 2N~ A , we have A € A e n As(A). Therefore, we can use the first two bounds of 
the theorem and the triangle inequality to complete the proof of the remaining bounds 
that now hold with probability at least 1 — 3N~ A . 

It only remains to show that by adjusting the value of the constant C in the definitions 
of the sets A e (A) and As (A), it is possible to ensure that the bounds hold with probability 
at least 1 — N~ A , as was claimed. To this end, check that for c := log 2 3 + 1 and all 
A> c,N> 2, 

3N~ A < N~ A/c . 

Now, take A' = A/c > 1 and replace the constant C with C\fc to show that the bounds 
hold with probability at least 1 — N~ A . □ 

Under the condition (2.2), the bound (2.3) holds and one can derive from Theorem 1 
the bounds expressed in terms of quantity /?2, namely, replacing (3{J\) in the inequalities 
of Theorem 1 by the upper bound B(J\)p2(J\)^d(\). However, below, we will give 



I AlogN 
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another version of such a statement with bounds on the norms || • and || • \\i 2 , and 

with a slight improvement of the logarithmic factor in the definition of the set of sparse 
vectors As (A). 

Define (with a minor abuse of notation) 

fc(d) := 2 (d; n) := max{/3 2 ( J) : J C {1, . . . , N}, d{J) < 2d} 

and 

B(d) := max{B( J) : J C {1, . . . ,N}, d( J) < d}. 
Let d denote the largest d < — — 1 such that 

o — c 

Adlog(N/d) < t 
n 

and 



C*(d)A(d) sup ||/ tt |kJ ^ Wd) <l/4. 

||M||f 2 <l,d(«)<d v n 

We redefine the set of "sparse" vectors as follows: 

AsA A ) := {A £ l w : d(X) < d}. 

Let 

A*(A) :=A e (A)nA S)2 (A), 

which will now play the role of an oracle set (that is, the set of sparse enough vectors 
that approximate the target function /* reasonably well). 
We will use the notation 

n \ x ^ / n 

Theorem 2. Suppose that condition (2.2) holds. There exists a constant C in the defi- 
nitions of A £ (A) and As j2 (A) such that for all A>\, with probability at least 



1 - 5~ dA 



N 
<d 



the following bounds hold: VA € A e D As.2{A) 

IIA-Mk(n) < mB 2 (d(X))p 2 (d(X))y/d(X)e 

and 



|A - A||f a < 32B 2 {d(X))/3 2 (d{X)) v QJX)e. 
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Suppose that A.^(A) 7^ 0. Then, with probability at least 1 — N A , 



HA-/*|U 2( n)< l nf [||/A-/*IU 2( n) + 16B 2 (d(A))/3 2 (d(A))v^(A)e]. 

AgA2(A) 



Moreover, if, in addition, /* = , A* € R , £/ie?i, a/so H»i/i i/ie same probability 



||A-A*|U 2 < inf [||A-A*|| £ ,+32 J B 2 (d(A))/3|(d(A))Vrf(A)£ 



We will need the following well-known fact (see Candes and Tao (2005), proof of their 
Theorem 1). 



Lemma 1. Let u G Cj. Define J := J and Zet Ji &e i/ie se£ 0/ d coordinates in 
{1, . . . , AT} \ J for which the \Uj\'s are the largest, J 2 be the set of d coordinates in 
{1, . . . , N} \ (Jo U Ji) /or which the \uj\ 's are the largest, etcetera. Define u^ k ' := (uj : j € 
Jfc). Then, u = ^2 k>Q u( kS> and 



which also implies that 



Eii« w iu.*(E 



fc>2 



1/2 



lull/, < 



2 E 



1/2 



Proof. For all j e Jfe+i, 



implying that 



w^E 



1/2 



Adding these inequalities for k = 1, 2, . . . yields 



< V |« 3 -|. 



fc>2 

Thus, for u€Cj, 



'jeJ 



1/2 



< 



E 



1/2 



Mk<2( E M 2 



1/2 
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□ 



Proof of Theorem 2. This is a straightforward modification of the proof of Theorem 
1. Suppose that A G A e n As, 2(A). Instead of (2.6), we now use 

IIA - Mkcn) = II/a - MU l( n n) + (n- n„)(|/ A - A I) 

<IIA-/AlU 1 (n B )+ sup |(n„-n)(|/„|)|||A-A||, 2 . 

\\u\\e 2 <l,u£Cj x 

To bound |A — X\\e 2 , we observe (as in the proof of Theorem 1) that A — A € Cj x and 
apply Lemma 1 to u = A — A, J = J\: 

{ \ 1/2 

||A-A|U 2 <2 I^-^I 2 <2A(d(A))||/ x -/ A || £a(II ). (2.9) 

We will also use Lemma 6 to bound 



sup |(n n -n)(|/„|)|<c sup ||/ M |U4/ ^ los(iV/rf) , (2.10) 

H«|| f2 <i,«GCj A IMU 2 <M(«)<d 
which holds with probability at least 



1 - 5- dA 



N_ 
<d 



-A 



The first term in the right-hand side of (2.8) is bounded as in the proof of Theorem 1 



||/A-/A|U l( n„)<V 2£ l|A-A|k (2.11) 

and we then use 

/ \ 1/2 

||A-A||, 1 <2^|A,-A J |<2 V ^(A)I £ lA.-A.f 
jeJ SeJuJi / 

, (2.12) 

<2(i 2 (d{\))^d\\)\\.f x - h\\ L2(Yi) . 

It remains to substitute bounds (2.9)-(2.12) into (2.8), to also use (2.2) and to solve 
the resulting inequality with respect to ||/ A — /a||l 2 (it) to obtain the first bound of the 
theorem. 

To prove the second bound, it is enough to use (2.9) to obtain a bound on || A — A||i 2 . 
The remaining two bounds are proved the same way as in the proof of Theorem 1. □ 

Condition (2.2) required in Theorem 2 is rather restrictive. Moreover, since the ip\- 
norm of /„ is involved in the definition of the set As,2{A) of sparse vectors, one might 
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need the equivalence of the ip±- and the L2-norms on the linear span of the dictionary in 
order to have a more explicit way to describe the sparsity of the problem. Condition (2.2) 
is not needed in Theorem 1. However, this condition is needed to bound the quantity 
f3( J) in terms of the quantity fc^J), the latter being much more convenient because of its 
simple relationships to various geometric characteristics of the dictionary (see Section 3). 
So, in both cases, one must rely on condition (2.2) and the class of examples to which the 
results apply is rather limited (such as Gaussian and Rademacher dictionaries). Below, 
we give another version of sparsity oracle inequalities for the Dantzig selector that does 
not have this drawback and which applies to a variety of dictionaries. However, in this 
case, much more is required in terms of sparsity. In the orthonormal case, the result 
applies only to oracle vectors A with 



d(A) < c 



log TV 



A similar constraint was needed, for instance, in sparsity oracle inequalities for LASSO 
in the paper by Bunea, Tsybakov and Wcgkamp (2007). In Theorems 1 and 2, the oracle 
sets were larger, including vectors A with d(X) comparable to n. 
The set of "sparse" vectors is defined as 



A S , 3 (A) ■= | A 6 R N : Cff 2 (J x )d(X) ^ jl^'lk \J < 1/8 j 
and the oracle set becomes 

Kl(A) :=A E (i)nA s , 3 (A). 

Theorem 3. There exists a constant C in the definitions of A e (A) and As, 3 (A) such 
that for all A> 1 with probability at least 1 — N~ A , the following bounds hold: VA G 
A £ nA s , 3 (A) 



\\h~ hh^n) <m(Jx)Vd(Xje, 
||A-A|| fl <16/3|(J A )d(A)e 

and 

||A-A|| f2 <16/3 2 2 (rf(A))yrf(A)£. 
Suppose that A^(A) 7^ 0. Then, with probability at least 1 — N~ A , 



||jW*IU 2 (n)< inf [\\fx-fAWn)+m{Jx)Vd^)e\. 

Moreover, if, in addition, /* = f\*,X* G M. N , then, also with the same probability 
||A-A*|| £l < inf [||A-A*||^ + 16/?!(J A )d(A) £ ] 

AGA3(A) 
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and 

||A-A*||< 3 < inf [\\X-\*\\ e „ + 16^(d(\))\/d{X)e}. 
AeAf(A) 

Proof. This is similar to the proofs of Theorems 1 and 2. The following bounds are used 
for all A e A e n A S , 3 (A): 

||A - AH,, < 2p 2 (J x )^d(\)\\f x - M| La(n)> 
||A-A||, 2 <2^(d(A))||/ x - AlU^n), 

and 

II fx Ml! 2( n) < II/a - Ml! 2( n n) + (n- n n )(|/ x - m 2 ) 

<2e\\\-\\\ tl + sup |(n„ - n)(|/„| 2 )|||A - AH^, 

and then Lemma 5 is applied to bound the last term on the right-hand side. □ 

It is not our goal in this paper to study the fixed design case in detail. However, some 
results are rather easy to obtain, in a manner similar to our derivations in the random 
design case (actually, with some simplifications). In particular, the following result holds. 
We will use a version of fciJ) with II replaced by the empirical measure II n (based on 
the design points): 

/3 2 (J):=/3 2 (J;n„) and &(d) := /3 2 (d;U n ). 

Theorem 4. Suppose X\, . . . ,X n are non-random design points in S and let H n be the 
empirical measure based on X±, . . . ,X n . Suppose, also, that = fx-. A* GR*. There 
exists a constant C > such that, for all A>\ and all 



£>C||£IU 2 m^ N \\h k \\ L2{rin) 

with probability at least 1 — N~ A , the following bounds hold: 

II A - /A*lk(n„) < 2 {Jx-)VdW)e, 
||A-A*||* <8/3f(J A ,)<A*) £ 



/ A log TV 



and 



\\\-\*h 2 <^l{d{\*))^d(X)e. 
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Proof. This is, essentially, a simplified version of the arguments used in the proofs of 
Theorems 1 and 2. The following two bounds are obtained exactly as in the proof of 
Theorem 1: 



II/a-A*IIl 2( ii„)< \/2£||A-A*|k (2.13) 

and 



||A - A*|k < 2/3 2 (J A .)Vrf(F)||A - fx* \\ L2 (ii n y (2.14) 
They hold if A*€ A £ , which is equivalent to the condition 



max 

Kk<N 



3=1 



<£. 



A\ogN 

e>C||^||^ max ||/i fc || L2 (n„)l 



i<fc<Ar ' ' V n 



then standard exponential bounds for sums of independent random variables and bounds 
on the maximum of random variables in Orlicz spaces imply that with probability at least 

l-N~ A , X* e Ke- 
lt remains to combine (2.13) and (2.14) to prove that with probability at least 1 — N~ A , 



II fx - A-IU a (n») < Wx*) \/d[X^e 

and 

||A^A*||, 1 < 8/3 2 2 (J A »MA*) £ . 

Arguing exactly as in the proof of Theorem 2 (in particular, using Lemma 1), one can 
add to this that 

\\X-X*\\ i2 <8$l(d(X*))^/d(^)6. D 

One can also obtain an upper bound on /3 2 ( J), d( J) = d in terms of fixed design versions 
of restricted isometry constants (see Lemma 2, Corollary 6), which leads to Theorem 1 
in Candes and Tao (2007) (bounds on the performance of the Dantzig selector under the 
UUP). They also proved a sharper oracle inequality in the case of random design that we 
are not going to reproduce here. However, Corollary 3 in the next section provides a direct 
proof of a similar inequality in the random design case (for orthonormal dictionaries) . 

3. Corollaries and remarks 

Under the additional assumption 

||A|k<B||M| il(n3 , A e R N , (3.1) 
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it is easy to establish a corollary of Theorem 1 that implies the main results of Candes 
and Tao (2007) in the random design case. Assume that /* = f\* , A* € Wl N . 
Define 



A E (A):=A e (A)n\\eR N :CB max ||M*)IU V ^ ~ ~ < V* \- 

i<k<N V n 

Corollary 1. Suppose that condition (3.1) holds. There then exists a constant C in the 
definition of the set A e (A) such that, for all A>1 and under the assumption A e (^4) ^= 0, 
with probability at least 1 — N~ A , 



||A-A*|| & < inf [||A-A*|| fo + 16B 2 Vrf(A)e]. 

A6A E (A) 



Proof. Under the assumption (3.1), 



£>il ^ V^JllAll^ < B^d(T)\\f x \\ Lim , 



implying that (3{J) < By/d{J). Therefore, A £ (A) C A £ (A). Denoting by A the value of A 
that minimizes the right-hand side of the bound of Corollary 1, the bound 



HA - ftlk(ir) < 16/3(J X )e < 16B^d(A)e 
follows from the first inequality of Theorem 1 . This yields 

||A-a||* 2 <16£ 2 -^(a>, 

implying the result. □ 
If 



e>C max Uh k (X)\\^J 2— (3.2) 



and the vector A* is sufficiently sparse in the sense that 



CB max \\h k (X)\\^ J A <A*)logiV ^ (33) 
i<k<N V n 

then Corollary 1 immediately implies that with probability at least 1 — N~ A , 



||A- X*\\e 2 < \6B 2 ^d(\*)e. 

It is enough to observe that, in this case, A* € A £ (A) and to use A = A* in the bound of 
the corollary (without taking the infimum). By simple properties of Orlicz norms, 

UhkiX^KUW^WhkiX)^. (3.4) 
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(Indeed, for random variables 771,772 such that ||77i||,/> 2 < 1, the following holds by the 
definitions of the norms: 

Italk < Ufa? + vl)/n^ < (Nik + \\r£ IUJ/2 < (IMk + IMk)/2 < 1. 

This immediately implies that for all 771 , 772 , 

1 1 771 772 1 1 v>i < IMklMk-) 

If £ is a normal random variable with mean zero and variance a 2 , the f/^-norm of £ 
coincides with a (up to a numerical constant). So, under the assumption that 

IM*)lk < 1. fc = i,...,w, 

conditions (3.2) and (3.3) take the following form: 



M log AT , x 

e>Ca\ — (3.5) 



and 



C W^M<l/i (3.6) 



The case a = (no noise in the regression model) is of special interest. In this case, 
||^/ifc(X)|j^, 1 = and one can use e = in the definition of the Dantzig selector. The 
following result holds. 

Corollary 2. Suppose that £ = and \\hk{X)\\^ 1 < 1. Let e = 0. // condition (3.1) and 
sparsity condition (3.6) hold, then, with probability at least 1 — N~ A , A = A*. 

Moreover, if we assume that both ip±- and Li-norms on the linear span of the dictionary 
arc equivalent, up to numerical constants (independent of iV), to the £ 2 - n orm in the 
space of vectors of coefficients (which is true, for instance, for Gaussian and Rademachcr 
dictionaries), then the sparsity assumption (3.6) can be replaced by a slightly weaker 
assumption d(X*) < d, where d satisfies the condition 



(3 . 7) 

n 

with a proper choice of c, and the conclusion of Corollary 2 still holds (when N = n, this 
means that d(X*) < cn, with a proper choice of constant c). Theorem 2 must be used, 
leading to the bound 

P{A^A*}<5- JA ^- 
so, in this case, the probability of the error is bounded in a better way. 
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Hence, the Dantzig selector provides an exact solution to the problem of recovery of a 
sparse vector A* based on noiseless measurements of function fx* at random points. It is 
easy to see that in this case, one can also use another definition of A, as a minimizer of the 
^i-norm \\X\\e 1 subject to the linear constraints f\{Xj) = Yj, j = 1, . . . , N with no changes 
in the proof. This striking fact has been known for a while and has some interesting 
connections to deep results in convex geometry and asymptotic geometric analysis (such 
as, for instance, neighborliness of convex polytopes; see Donoho (2006a, 2006b), Candes 
and Tao (2005), Candes, Romberg and Tao (2006), Rudelson and Vershynin (2005), 
Mendelson, Pajor and Tomczak-Jaegermann (2007) and references therein). 

We will consider another interesting consequence of Corollary 1 under the additional 
assumptions that the dictionary is orthonormal in L2 (II) and that the ip2- and the L2- 
norms are equivalent on the linear span of the dictionary up to a numerical constant. 
Because of orthonormality, the L2-norm is equal to the ^-norm in the space of coeffi- 
cients, and condition (3.1) becomes, in this case, a version of condition (2.2): 



Thus, all of the Orlicz norms between L± and ip2 arc equivalent in this case. In particular, 
this applies to Gaussian and Rademacher dictionaries. The result given below also follows 
from the oracle inequality proven by Candes and Tao (2007) in the fixed design case 
(under the UUP condition). It is also not hard to establish it for the dictionaries that 
are not necessarily orthonormal, but that satisfy some assumption on the "weakness" of 
correlations between functions hj . 

Corollary 3. Under the above assumptions, including (3.1), the assumption that the 
noise is N(0;o~ 2 ), that the dictionary is orthonormal in £2(11) and that the t/j2- and the 
L/2-norms are equivalent on the linear span of the dictionary up to a numerical constant, 
there exists a choice of constant C such that for all e satisfying condition (3.5) and A* 
satisfying the sparsity assumption (3.6), with probability at least 1 — N~ A and with some 
D>0 depending on B in condition (3.1), 



||/A||i 2 (n) < B\\fx\\ Ll (u), 



Ael 



iY 




N 



Proof. Define A* as follows: 



A*=A*/(|A*|> £ /3). 



We then have 




Uh k {X)\\^<U\\^\\h k (X)\\^<aT, 
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with a numerical constant c, implying that with a proper choice of C, C", 



AlosN 

c||£M*)lk\/^^< £ A 



provided that 



Finally, 



e>cJ A]OKN 



\\(f- x . - f x .)(X)h k (X)\\^ < \\(fr. - .M(X)\\^\\h k (X)U 2 <c\\f x , -/A«|U 2( n) 

E |A*| 2 ) 1/2 < C ( £ /3)v/rf(Ai, 



j: |A*|<e/3 



which implies that 



provided that 



C||(/ x .-/ A .)W^(X)|U I J^^< e /3, 



ecJ^W < 1. 
rt 

The last condition is equivalent to (3.6) with a proper choice of constant therein. 
Hence, A* <G A e (A) and Corollary 1 implies that with probability at least 1 — N~ 

1/2 



-A 



||A-A*|U 2 <( J2 I A j| 2 ) +16S 2 v /card(j:|A*|> £ /3) £ , 
which yields that, with some constant D (depending on B), 

N 

\\\-\*\\i<Dj2(\^\ 2 ^e 2 )- 

U n 

We now describe a couple of ways of bounding the quantity foiJ) involved in Theorem 
2 and in the upper bound on (3{J). 

Let k(J) denote the minimal eigenvalue of the Gram matrix ((/ii, ^j)i 2 (n))ijGJ- Also, 
denote by Lj the linear span of {hj : j € J} and let 



p(J) := sup 

feLj,geLjc,f,gjiO 



if: 9) 



L 2 (n) 



|/IU 2 (n)||fl'IU 2 (n) 
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p(J) is thus the largest "correlation coefficient" (or the largest cosine of the angle) be- 
tween functions in the linear span of a subset {hj : j € J} of the dictionary and the linear 
span of its complement. It is of interest to compare p(J) with the concept of canonical 
correlation used in multivariate statistical analysis. It is very easy to check (sec Koltchin- 
skii (2009), Proposition 1) that 



V«(J)(1-P 2 (J)) 

and, as a consequence, if (2.2) holds with some constant B( J) > 0, then 



0(J) < \fd\j), 

where 

B 2 (J)d(J) 



d(J) :-- 



K(J)(l-ffi(J)Y 

In particular, if the dictionary {hi, . . . , ft-jv} is orthonormal in £2(11) and condition (2.2) 
holds with a constant B that does not depend on J (for instance, in the case of Gaussian 
or Bernoulli dictionaries), then n(J) = 1 and p{ J) = 0, so d{J) = B 2 d(J), leading to the 
bound 



P(J) < Byfd{J). 

We define 

d(X):=d(J x ), 

which plays the role of a modified "dimension" of the vector A (that takes into account 
how close the dictionary is to the orthonormality property; in the orthonormal case, 
d(\) = B 2 d(X)). 
Define 



A e (A) := A e (A) n j A e R N : C ma^ \\h k (X)\\^ M W°Z N < i/ 4 |. 

The proof of the following corollary repeats the proof of Corollary 1, with the ^2-norm 
replaced by the L2(n)- norm - 

Corollary 4. Suppose that condition (2.2) holds. There exists a constant C in the def- 
inition of A £ ( J 4) such that for all A> 1, the assumption A £ (A) ^ implies that with 
probability at least 1 — N~ A , 

Wfx - /*lk(n) < inf [||/a - Mk ( n) + I6^)e]. 
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Denote by f x o, A <E the orthogonal projection in £2 (II) of /* onto the linear span 
of the dictionary. The following result is also straightforward. 

Corollary 5. Suppose that the condition (2.2) holds and that the noise is normal with 
mean zero and variance a 2 . There then exists a constant C such that for all A>1 and 
all 



e>C(\\ ha -MU„ + «h IAl0gN 



if A satisfies the "sparsity" condition 



MA°)logiV 
V n 

then with probability at least 1 — N~ A , 

Wfx - MlL(n) < IIA° - /*lli 2 (n) + 16 2 d(A°) £ 2 . 

Proof. Under the assumptions, it is easy to check that A € A £ (A). This allows one to 
deduce that, with probability at least 1 — N~ A , 

II/a - /Ao||| 2(n) < 16 2 J(A°)£ 2 . 
Since /c — fxo and f\o — /* are orthogonal, this implies the result. □ 

Another approach to bounding ^(J) is based on some quantities involved in the 
restricted isometry condition. 

For I, Jc {!,..., N}, I(1J = 0, define 



r(I; J) := sup 



(f,g) 



Mn) 



li2(n)||sl|L 2 (n) 



(for p(J) defined before, p(J) = r(J, J c )). Let 

Pd := max{r(J, J) : 7, J C {1, . . . , N}, InJ=0, d(I)=2d, d(J)=d}. 

This quantity measures the correlation between linear spans of disjoint parts of the 
dictionary of fixed cardinalities, d and 2d (it is a more "local" characteristic of the 
dictionary than the quantity p(J) used before). 
We will also define 

m,,:=inf{||/ u || £a( n ) :«GR JV ,||u||/ ! , = l,d(u)<d} 

and 

M d :=sup{||/„|U 2(ri ) :ueR N , \\u\\ t . 2 = l,d(u) < d}. 
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Pd < 



M 2d ■ 



Then, 



1 



m 2d - PdM 2 d 



Proof. Recall Lemma 1 and its notation. Denote by Pj the orthogonal projection on 
Li C L 2 (U). Then, for all ueCj, 



N 



> 



L 2 {n) 



N 



3 = 1 



L 2 {n) 



> 



> 



> 



> 



> 



> 



ieJoUJi 



E U J h 3 PjaU-h^Ujll 



i 2 (n) k>2 



l 2 (u) 



L 2 (n) 



ieJoUJi 



E Ujhj IK! E X "A 

L 2(n) fc > 2 j gJfc 

PdM 2d J2\\u {k) \ 



L 2 (Tl) 



E U A 



ieJoUJi 



L 2 (n) 



k>2 



f„l I 7, Lofil") V.v- 71 r. ' 



E M A 

i£J uJi 

M 2( A 

m 2d J 



1/2 



-Pd 

L,(H) m 2d 



E U A 



E U A 



ieJoUJi 



i 2 (n) 



On the other hand, for u £ Cj 



f \ 1/2 




1/2 






(Ew 2 J * 


( E N 2 ' 


1 < m 2d 


E U A 


i 2 (n) 




S"GJ U./i y 




jGJoUJi 
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implying that 



EN 



1/2 



< m 



2,1 



1 - Pd- 



M- 



2d 



m 2 d 



N 
3=1 



L 2 (n) 



which yields 



1 



m 2 d ~ PdMid 

If wid < 1 < Md < 2, one can express the restricted isometry constant 6d = 6d(H) as 

S d = (M d -l)V{l-m d ). 

It is also easy to show that 



□ 



Pd 



< 



1 



Kid 



1 



- 1 



1 - 



1 



':»„/ 



Lemma 2 then implies that there exists 5 < 1 such that, under the condition 63d < 
02(J) < c for all sets J with d(J) < d, where c is a constant that depends only on 5 (for 
instance, one can take 6 — 1/8). 

Denote by d the largest d such that d < N/c— 1, 63d < 6, 



Ad\og{N/d) 



< 1 



and 



Let 



CB(d) sup \\f u \\J Adl ° g{N/d) <l/i. 

\\u\\i 2 <l,d(u)<d » n 



A^(A) := A e (A) n {A G R : d(X) < d}. 

The next corollary is an immediate consequence of Theorem 2 and Lemma 2. It 
shows that sparse enough target functions can be recovered by the Dantzig selector 
under a version of the restricted isometry assumption. In particular, Proposition 1 in the 
Introduction immediately follows from this corollary. 

Corollary 6. There exist constants C,D depending only on 6 such that for all A > 1, 
the assumption A^(A) 7^ implies that with probability at least 1 — N~ A , 



||/x-MU 2 (n)< inf [\\fx-M\L 2{n) +DB 2 (d(X)Wd(X)e}. 

AeA^(A) 
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Moreover, if, in addition, /» = f\», A* G R^, then we also have 



||A-A*||,,< inf [\\\-\*\\ e . 2 +DB 2 {d(\))^d(X)e}. 

A6A2(A) 

Another way to bound the quantity fh{J) is given in the following proposition that is 
akin to some statements in Bickel, Ritov and Tsybakov (2009) (in the fixed design case). 
The proof is rather straightforward and is based on a simple modification of Lemma 1. 

Proposition 3. If J C {1, ... , N} with d(J) = d and, for some s > 1, 

M, [7 

< 



m s+d V d 1 



then 



•fsmd+a - VdM s 



We conclude this section with a couple of examples that provide some explanation of 
the role of such geometric characteristics of the dictionary as ^(J) hi sparse recovery 
problems. 

Example. Consider the case where the functions hi, ■ . . , /ijv arc orthogonal in Z,2(II). It 
is easy to see that 

1 



/3 2 (J) = — 



min ieJ \\hj\\L 2 (n) ' 
Suppose that /» = fx* with A* € and 

IIMiaCn) =r>0, j€J\>- 

Fix the value of the parameter e > of the Dantzig selector and consider, for simplicity, 
the limit case when n— > oo. In this limit, the set A e becomes 

k £ :={\€R N :\(fx-fx*,h k }\<e,k = l,...,N} 

which, under the orthogonality assumption, is just 

A £ := {A G R N : \X k - AJ|||/» fc ||| a(II3 < e, k = 1, . . . , N}. 

It is easy to compute the Dantzig selector: for k = l,...,N, 



xi ~w^)\ xl -w^) + v l+ w^)V l - wm^ 
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Therefore, 

l|A-A^ 2 = £ 2 £ JJ-^+ E Ml 2 - 

k:\xi\>e/\\h k \\l 2(n) 'Mn) fc|A;|< e /||fc fc ||» a(n , 

If I A* I > e/r 2 for all j G J*., we get 

||A-A*||, 2 = lv^)e. 
J£hj(X), j = l,...,N, are i.i.d. A^(0,t 2 ), this yields 

||A-A*|| 4l = ^(d(A*))Vd(A i )e > 

which is in agreement with the last bound of Theorem 2 in this case. Thus, the presence 
of fa (d) in the bound has something to do with the nature of the problem, although there 
might be different and, possibly, much better ways to take into account the geometry of 
the dictionary. 

Example. Suppose that 

n l 

{l,...,N}:={Jl k , 

fe=i 

where I k arc disjoint sets. Suppose that <pi(X), . . . ,4> m (X) arc i.i.d. N(0, 1). Let 

m 

ff/x = y^Mfc^fc, fl=(fix,. ..,H m ) G K"\ 

fe=l 

and define /ij := G Ik- It is easy to check that, for such a dictionary, the Dantzig 
selector A is a solution of the following problem. First, solve the problem 

m 
3=1 

subject to the constraints 

max 

l<fe<m 

Let /i be its solution. Then, take arbitrary A satisfying the conditions 

^Aj=/tfc, sign(Aj) =sign( / G fe ), j € I k , k = l,...,m. 



<e. 



.7 = 1 
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Clearly, we have = g^ and 

\\f\ ~ /*llL(n) = llffA _ /*lli 2 (n)- 
If /* = g M * for some /i* e R"\ then it follows from Theorem 2 that, with a high probability, 

lift - /*ll! 2 (n) = \\m - Ml! 2 (n) < CdfcV 
(under appropriate further assumptions, say, that £ is N(0,a 2 ) and 

e>C*J^»). 
V n 

Of course, in this case, the dictionary is linearly dependent, coefficients of representation 
fx are not identifiable and, in addition, such quantities as ^{J) are infinite. But the 
Dantzig selector is still recovering /* with the L2(n)-error being of the correct size. 

This example shows that there are situations beyond the scope of Theorems 1 and 2 
in which the Dantzig selector is a reasonably good estimator of an unknown regression 
function /* . It might be possible to develop more subtle geometric characteristics of the 
dictionary than (3 2 which can be used, for instance, when the dictionary can be partitioned 
into disjoint sets of highly correlated functions with very little correlation between the 
sets, and to prove sparsity oracle inequalities in terms of such characteristics. However, 
it is not our goal in this paper to study such situations in detail. 



Appendix: Several exponential bounds 



We need the following three lemmas. 

Lemma 3. Let r)( k \ r)[ , . . . , rjr^ bei.i.d. random variables with Er/ fe ' =0 and \\r)^ < 
+oo, k = 1, . . . , N. There exists a numerical constant C > such that for all A>1 with 
probability at least 1 — N~ A , for all k = 1, . 



.,N, 



3=1 



(fc) 



<c\W k) \ 



AlogN AlogN 



This is a consequence of a well-known version of Bernstein's inequality (see, for exam- 
ple, Lemma 2.2.11 in van der Vaart and Wellner (1996)). 

Lemma 4. There exists a constant C > such that for all A>1 with probability at least 
1-N- A , 



sup | (n„ - n) (| f u | ) | < C max 1 1 h k (X)\\.J J 

IMk <1 l<k<N \ V 



A log N A log N 
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Proof. Let R n {f) denote the Rademacher process 

n 
3=1 

e,£j,j = 1, . . . ,n, being i.i.d. Rademacher random variables independent of X±, . . . ,X n . 
For t > 0, we use the symmetrization inequality and then the contraction inequality (see 
Ledoux and Talagrand (1991), page 112) to get 

Eexpli sup |(n n -n)(|/ u |)|}<Eexp{2i sup |i?„(l/«l)l} 
1 IMIf!<i J k M\ tl <i ' 

Ecxp{4i sup \R n (f u )\\. 

L ll«ll£,<l J 



< 



Since the mapping u \— > R n {fu) is linear, the supremum of R n (fu) over the set {Hull^ < 1} 
is attained at one of its vertices and we get 



lexptt sup |(n„-n)(|/ n |)|} <Ecxp{4t max \Rn(h k )\\ 
L <i J L i<fe<iv J 



= N max E\exp{4tRJh k )} V exp{-4ii?„(/i fc )}l 

l<k<N 

< 2N max Eexp{4ti?„(/i fe )} 

l<fc<JV 

<2N max [ Eexpi A-eh k (X) 

l<k<N\ { n 

To bound the last expectation and to complete the proof, we need only to follow the 
standard proof of the Bernstein inequality. □ 

The proof of the next lemma is quite similar. 

Lemma 5. There exists a constant C > such that for all A>\, with probability at 
least 1 - N~ A , 



sup |(n n -n)(|/ tt | 2 )|<c max ji^po^-poju 

||u|| fl <i l<k,3<N 



Let J C {1, ... , N}, d(J) < d < f - 1. Define 

Kj:=Cjn{u€R N :\\u\U 2 <l}. 



^logTV A\ogN 
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Lemma 6. There exists a constant C > such that for all A > 1, with probability at 
least 



1 - 5- dA 



<d 



im nvifhl^ 11/ II I l Ad\og(N/d) \, Ad\og(N/d) \ 

sup |(n„-n)(|/„|)| <c sup WfuUi \/ v . 

ueKj \\u\\ e2 <i,d(u)<d yv n n J 

Proof. The idea of the proof is well known (see, for example, Ledoux and Talagrand 
(1991), page 421, or, in a context closer to the current paper, Mendelson, Pajor and 
Tomczak-Jaegcrmann (2007), Lemma 3.3). Recall Lemma 1 and its notation. Let u £ Kj. 
Lemma 1 implies that 

Kj C 3 conv(|J Bi:Ic{l,...,N], d(I) < d) , 

where 

Bj:=|(u<:ie J):5^|u<| a < l| 

since iS°^> e Bj , e Bj 1 and 

u {k) e conv(|J Bj :I C {1, . . . , N}, d(I) < d\ . 

k>2 

It is easy to see that if B is the unit Euclidean ball in M. d and M is a 1/2-net of this ball, 
then 

Be 2conv(M). 

A somewhat informal version of the proof of this claim that can easily be made precise 
is as follows: with + denoting Minkowski sum, 

B C M + \B C conv(M) + |Bc conv(Af) + \ conv(M) + \B C ■ • ■ 
C conv(A/) + | conv(M) + | conv(M) H C 2conv(M). 

Now selecting for each J with d(I) < d its minimal 1/2-net Mi yields 

Kj c 6conv(|J Mi :I C {1, . . . , N}, d(I) < d^j =: 6 conv(7W rf ). 
Therefore, we can repeat the proof of Lemma 4 and reduce the bounding of 

sup |(n„-n)(|/ M |)| 

uGKj 
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to the bounding of 

SUP \R n {fu)\, 
uEM d 

with card(A'J c i) playing the role of N. It remains to observe that 

cavd(Md) < b d 
which implies that with some c > 0, 

log (card (M d )) < cdlog^-, 
and the result follows. □ 
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